160 research outputs found
The importance of better models in stochastic optimization
Standard stochastic optimization methods are brittle, sensitive to stepsize
choices and other algorithmic parameters, and they exhibit instability outside
of well-behaved families of objectives. To address these challenges, we
investigate models for stochastic minimization and learning problems that
exhibit better robustness to problem families and algorithmic parameters. With
appropriately accurate models---which we call the aProx family---stochastic
methods can be made stable, provably convergent and asymptotically optimal;
even modeling that the objective is nonnegative is sufficient for this
stability. We extend these results beyond convexity to weakly convex
objectives, which include compositions of convex losses with smooth functions
common in modern machine learning applications. We highlight the importance of
robustness and accurate modeling with a careful experimental evaluation of
convergence time and algorithm sensitivity
Mean Estimation from Adaptive One-bit Measurements
We consider the problem of estimating the mean of a normal distribution under
the following constraint: the estimator can access only a single bit from each
sample from this distribution. We study the squared error risk in this
estimation as a function of the number of samples and one-bit measurements .
We consider an adaptive estimation setting where the single-bit sent at step
is a function of both the new sample and the previous acquired bits.
For this setting, we show that no estimator can attain asymptotic mean squared
error smaller than times the variance. In other words,
one-bit restriction increases the number of samples required for a prescribed
accuracy of estimation by a factor of at least compared to the
unrestricted case. In addition, we provide an explicit estimator that attains
this asymptotic error, showing that, rather surprisingly, only times
more samples are required in order to attain estimation performance equivalent
to the unrestricted case
Mean Estimation from One-Bit Measurements
We consider the problem of estimating the mean of a symmetric log-concave
distribution under the constraint that only a single bit per sample from this
distribution is available to the estimator. We study the mean squared error as
a function of the sample size (and hence the number of bits). We consider three
settings: first, a centralized setting, where an encoder may release bits
given a sample of size , and for which there is no asymptotic penalty for
quantization; second, an adaptive setting in which each bit is a function of
the current observation and previously recorded bits, where we show that the
optimal relative efficiency compared to the sample mean is precisely the
efficiency of the median; lastly, we show that in a distributed setting where
each bit is only a function of a local sample, no estimator can achieve optimal
efficiency uniformly over the parameter space. We additionally complement our
results in the adaptive setting by showing that \emph{one} round of adaptivity
is sufficient to achieve optimal mean-square error
Distributed Delayed Stochastic Optimization
We analyze the convergence of gradient-based optimization algorithms that
base their updates on delayed stochastic gradient information. The main
application of our results is to the development of gradient-based distributed
optimization algorithms where a master node performs parameter updates while
worker nodes compute stochastic gradients based on local information in
parallel, which may give rise to delays due to asynchrony. We take motivation
from statistical problems where the size of the data is so large that it cannot
fit on one computer; with the advent of huge datasets in biology, astronomy,
and the internet, such problems are now common. Our main contribution is to
show that for smooth stochastic problems, the delays are asymptotically
negligible and we can achieve order-optimal convergence results. In application
to distributed optimization, we develop procedures that overcome communication
bottlenecks and synchronization requirements. We show -node architectures
whose optimization error in stochastic problems---in spite of asynchronous
delays---scales asymptotically as \order(1 / \sqrt{nT}) after iterations.
This rate is known to be optimal for a distributed system with nodes even
in the absence of delays. We additionally complement our theoretical results
with numerical experiments on a statistical machine learning task.Comment: 27 pages, 4 figure
- β¦